Uisge Beatha: Water of Life or Watered Down? Applied Multivariate Methods in Scotch Differentiation

Data Used

Whisky Origin and Chemical Data
Sample_no Descriptor Distillery P S Cl K Ca Mn Fe Cu Zn Br Rb
1 Blend Baile Nicol Jarvie 0.152 1.100 0.173 7.860 1.450 0.032 0.027 0.186 0.015 0.002 0.006
2a Blend Bells 0.653 1.580 0.238 4.930 1.400 0.019 0.110 0.242 0.021 0.005 0.003
3a Blend Chivas 0.375 0.809 0.193 4.310 1.220 0.019 0.044 0.196 0.007 0.003 0.002
4a Blend Dewars 0.121 1.160 0.157 3.200 1.140 0.011 0.050 0.189 0.018 0.003 0.003
5a Blend Johnnie Walker 0.326 1.090 0.180 5.480 0.526 0.018 0.103 0.286 0.020 0.002 0.002
6a Blend The Famous Grouse 0.145 0.615 0.097 2.740 0.416 0.009 0.050 0.208 0.007 0.002 0.001
7a Blend Whyte and Mackay 0.067 0.576 0.151 2.360 0.745 0.012 0.047 0.159 0.019 0.003 0.002
8a Blend William Grant 0.239 0.748 0.147 2.840 0.976 0.010 0.021 0.137 0.020 0.003 0.002
9a Counterfeit Unknown 1 0.089 4.060 0.066 0.336 1.240 0.007 0.154 0.085 0.038 0.005 0.001
10a Counterfeit Unknown 2 0.088 14.700 0.072 1.230 1.400 0.006 0.025 0.052 0.018 0.004 0.001
11a Counterfeit Unknown 3 0.279 15.900 0.083 0.811 1.360 0.006 0.057 0.038 0.016 0.002 0.002
12a Counterfeit Unknown 4 0.320 22.100 0.596 2.320 1.780 0.008 0.019 0.038 0.015 0.068 0.001
13a Counterfeit Unknown 5 0.120 26.100 0.071 2.370 1.630 0.010 0.082 0.187 0.194 0.012 0.005
14a Grain Grain matured 0.034 2.230 0.252 6.440 1.040 0.013 0.115 0.174 0.019 0.004 0.006
15a Grain Grain unmatured 0.084 5.530 0.113 3.250 1.350 0.012 0.076 0.164 0.046 0.010 0.003
16 Highland Glengoyne 1.040 5.570 0.343 24.200 0.857 0.023 0.197 1.251 0.041 0.004 0.016
17 Highland Glenmorangie 0.126 0.796 0.245 6.950 0.859 0.035 0.025 0.523 0.011 0.003 0.006
18a Island Bowmore 0.914 6.670 0.316 21.100 0.868 0.037 0.148 0.548 0.032 0.007 0.018
19 Island Bruichladdie 1.630 5.480 0.697 36.500 4.130 0.038 0.288 0.587 0.066 0.034 0.039
20a Island Bunnahabhain 2.240 7.540 1.350 36.200 2.120 0.051 0.184 0.580 0.057 0.014 0.037
21 Island Talisker 0.034 4.850 0.362 5.670 0.607 0.018 0.070 0.277 0.033 0.003 0.006
22a Lowland Auchentoshan 0.169 1.460 0.417 11.700 0.681 0.042 0.128 1.320 0.037 0.006 0.012
23a Lowland Glenkinchie 0.108 2.450 0.176 7.760 0.738 0.031 0.106 0.434 0.022 0.002 0.007
24 Speyside Balvenie 0.695 3.850 0.120 20.300 0.765 0.031 0.121 0.380 0.035 0.005 0.024
25 Speyside Craigellachie 0.096 0.819 0.177 6.110 0.633 0.024 0.094 0.239 0.025 0.005 0.006
26 Speyside Dufftown 0.883 4.640 0.130 14.000 1.050 0.030 0.078 0.533 0.024 0.002 0.014
27 Speyside Glen Elgin 0.115 1.350 0.404 9.270 1.400 0.031 0.046 0.195 0.029 0.006 0.009
28 Speyside Glenburgie 2.000 7.910 0.185 37.700 1.650 0.053 0.134 0.198 0.043 0.008 0.026
29 Speyside Glennfiddich 0.317 2.720 0.344 12.400 0.660 0.029 0.132 0.519 0.193 0.004 0.013
30 Speyside Glenrothes 0.953 4.110 0.399 16.700 1.830 0.041 0.137 1.030 0.029 0.007 0.014
31 Speyside Knockando 0.051 1.030 0.191 5.140 0.605 0.017 0.094 0.432 0.020 0.008 0.005
32 Speyside Linkwood 0.276 1.050 0.207 6.220 1.010 0.020 0.064 0.769 0.019 0.004 0.006

Summary of Trace Elements

Summary Statistics for Whisky Chemical Variables
Variable Mean Median SD Min Max
P 0.461 0.204 0.575 0.034 2.240
S 5.019 2.585 6.261 0.576 26.100
Cl 0.270 0.188 0.247 0.066 1.350
K 10.262 6.165 10.569 0.336 37.700
Ca 1.192 1.045 0.685 0.416 4.130
Mn 0.023 0.020 0.013 0.006 0.053
Fe 0.095 0.088 0.060 0.019 0.288
Cu 0.380 0.240 0.327 0.038 1.320
Zn 0.037 0.023 0.043 0.007 0.194
Br 0.008 0.004 0.012 0.002 0.068
Rb 0.009 0.006 0.010 0.001 0.039
  • Large differences in observation range and scale (up to \(10^4\))
  • Overall: Minimum (Rb) = 0.006, Maximum (K) = 37.7
  • Relatively large differences between median and mean in some variables, indicating higher variability

Initial viewing:

Assesment Multivariate Normal Distribution

Summary of Surprising Observations
Distance Category Count/HZ %/P-val
Bottom_50% 22.000 68.8
50-75% 2.000 6.2
75-90% 0.000 0
90-95% 1.000 3.1
95-99% 4.000 12.5
Top_1% 3.000 9.4
Henze-Zirkler Test 1.325 <0.001

  • Somewhat surprising observations omitted as too few to calculate a density (1)
  • Orange denotes surprising observations (\(0.95 < d_M < 0.99\))
  • Red denotes very surprising observations (\(d_M > 0.99\))

Post log-transformation

Reassessment Multivariate Normal Distribution

Summary of Surprise Categories
Distance Category Count/HZ %/P-val
Bottom_50% 17.000 53.1
50-75% 7.000 21.9
75-90% 5.000 15.6
90-95% 2.000 6.2
95-99% 1.000 3.1
Top_1% 0.000 0
Henze-Zirkler Test 0.984 0.137

  • Orange denotes somewhat surprising observations (\(0.90 < d_M < 0.95\))
  • 9 out of 11 variables displayed univariate normality
  • Zn (\(A^2 = 0.872, p = 0.022\)) and Br (\(A^2 = 1.075, p=0.007\)) displayed significant deviations via Anderson-Darling test

Structure examination

  • Strong positive correlations K, Rb; Mn, Rb; Mn, K
  • Moderate correlations throughout elsewhere
Whisky Class Hotelling's T² Test Results
Comparison T² Statistic P-value
Counterfeit vs Speyside 7,083.10 0.009
Blend vs Speyside 213.48 0.026
Counterfeit vs Blend 928,150.00 0.009
Island vs Speyside 171.63 0.581
Grouped Hotelling's T² Test Results
Comparison T² Statistic P-value
Provenance vs Counterfeit 1,181.90 0.000
Provenance vs Grain/Blend 137.44 0.000
Grain/Blend vs Counterfeit 474.57 0.042

Whiskies of:

Provenance

  • All positive correlations, weak-strong
  • \(n=17\)

Counterfeit

  • Moderate Fe and other negative correlations
  • \(n=5\)

Blend/Grain

  • Maximum correlations less than other groups
  • \(n= 10\)

PCA Analysis:

  • Approximately 80% of variance is contained within the first three principal components
Standardized Log-data PCA Summary
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11
Standard deviation 2.2650 1.5491 1.0852 0.8647 0.6564 0.6074 0.4985 0.4517 0.4210 0.2857 0.1818
Proportion of Variance 0.4664 0.2182 0.1071 0.0680 0.0392 0.0335 0.0226 0.0186 0.0161 0.0074 0.0030
Cumulative Proportion 0.4664 0.6846 0.7916 0.8596 0.8988 0.9323 0.9549 0.9735 0.9896 0.9970 1.0000

PCA Loadings

Principal Component Loadings
First Three Components
PC1 PC2 PC3
P 0.311 0.118 −0.196
S 0.094 0.534 0.223
Cl 0.315 0.006 −0.422
K 0.404 −0.161 −0.131
Ca 0.149 0.475 −0.301
Mn 0.379 −0.239 −0.132
Fe 0.310 −0.004 0.469
Cu 0.327 −0.347 0.118
Zn 0.241 0.244 0.570
Br 0.183 0.454 −0.205
Rb 0.415 −0.074 0.088

  • PC1 primarily influenced by positive loadings of P, Cl, K, Mn, Fe, Cu, Rb
  • PC2 primarily influenced by positive loadings of S, Br, Ca and negative loadings Cu, Mn
  • PC3 differentiates groups by various positive and negative inputs

Differentiation within the PCA Space

Linear Discriminant Analysis

  • Shand et al. (2017) conducted an LDA using the given whisky groups within the first 3 principal components
  • Homogeneity of covariance matrices couldn’t be established as the smallest group sizes (\(n=2\)) were smaller than predictors (\(p=3\)), and the covariance matrices thus singular
  • Established correlation plots suggests differing covariance matrices, at least for counterfeits
  • Using the first three principal components and aggregated groups, a global BoxM test indicated significant covariance matrix heterogeneity and (\(X^2_{12}, p = 0.041\)), and thus did not proceed with an LDA
Box’s M-test for Homogeneity of Covariance Matrices
Comparison Chi_Sq df p_value
Global Test 21.706 12 0.041
Provenance vs Counterfeit 11.457 6 0.075
Grain/Blend vs Counterfeit 14.066 6 0.029
Grain/Blend vs Provenance 14.066 6 0.029

Kmeans Assesment:

  • Elbow plot showing k-means clustering results for \(k = 1\)\(10\), using Euclidean distance and the Hartigan–Wong algorithm
  • Dashed red line marks the visually identified optimal cluster number at \(k = 3\), based on the diminishing returns in variance reduction.
  • Dark red shading represents the 37.37% of total variance explained by \(k = 3\) choice relative to further increases in \(k\).
K-means Clustering Comparison
Metric K = 3 K = 4
Cluster Sizes 10, 6, 16 8, 2, 6, 16
Variance Explained 51.6% 58.9%
Avg Silhouette 0.30 0.28
Total Within SS 164.9 140.08
Between SS 176.1 200.92
Total SS 341 341

Silhouettes:

Groupings in PC Space

k = 3

k = 4

PAM Assesment

K = 3

PAM Clustering Results (K=3)
Overall Avg Silhouette: 0.294
Cluster Size Medoid Avg Diss. Separation Avg Silhouette
1 17 4 2.234 2.304 0.344
2 5 10 2.962 2.653 0.138
3 10 18 2.277 2.304 0.287
PAM Clustering Results (K=4)
Overall Avg Silhouette: 0.149
Cluster Size Medoid Avg Diss. Separation Avg Silhouette
1 8 4 1.875 1.839 0.080
2 9 25 1.848 1.839 0.192
3 5 10 2.962 2.653 0.074
4 10 18 2.277 2.304 0.204
  • Results consistent with k-means clustering at \(k=3\)

  • Group 1 Medoid = 4, Blend (Dewars)

  • Group 2 Medoid = 10, Counterfeit

  • Group 3 Medoid = 18, Island (Bowmore)

  • Much poorer performance at \(k=4\)

  • Worsened separation of group 1 and 2

  • Large silhouette reductions at \(k=4\) across groups

  • Due to robustness in performance, set \(k^*=3\)

PAM, k=3 in Principal Component Space

k = 4

Hierarchical Clustering Assesment:

Replication of Hierarchical Clustering

Comparative quality:

Confusion Matrices

Consensus Clustering Results
K-means (k=3) and hierarchical(Manhattan [Complete & Ward], and Euclidean [Ward])
Predicted Counterfeit Grain/Blend Provenance
Counterfeit 5 1 0
Grain_Blend 0 9 7
Provenance 0 0 10
PAM Clustering
Predicted Counterfeit Grain/Blend Provenance
Counterfeit 5 0 0
Grain_Blend 0 10 7
Provenance 0 0 10
Correlation (1-r) Hierarchical Clustering
Predicted Counterfeit Grain/Blend Provenance
Counterfeit 5 1 0
Grain_Blend 0 9 10
Provenance 0 0 7
Euclidean (complete) Hierarchical Clustering
Predicted Counterfeit Grain/Blend Provenance
Counterfeit 5 5 0
Grain_Blend 0 5 15
Provenance 0 0 2

Quality Metrics

Clustering Method Performance Comparison
Global Confusion Matrix Metrics
Method Overall Acc. Average Acc. F1 (Macro) TNR (Macro) F1 (Micro) TNR (Micro) TPR (Micro)
Consensus 0.750 0.833 0.781 0.882 0.758 0.801 0.758
PAM 0.781 0.854 0.827 0.894 0.793 0.835 0.793
Correlation HC 0.656 0.771 0.704 0.836 0.677 0.735 0.677
Euclidean HC 0.375 0.583 0.404 0.711 0.452 0.479 0.452
PAM Class-wise Performance
Counts and Derived Quality Metrics
TP TN FP FN ACC_i MR_i PPV_i TPR_i TNR_i F_class
Counterfeit 5.000 27.000 0.000 0.000 1.000 0.000 1.000 1.000 1.000 1.000
Blended 10.000 15.000 7.000 0.000 0.781 0.219 0.588 1.000 0.682 0.741
Provenance 10.000 15.000 0.000 7.000 0.781 0.219 1.000 0.588 1.000 0.741

Summary:

  • PAM clustering at \(k = 3\) performed similarly well as Shand et al.’s (2017) LDA results in identifying counterfeits

  • Strong agreement across clustering methods indicates robustness of the XTRF approach for counterfeit detection in field applications

  • Moderate to weak clustering performance was observed for distinguishing between regional and blended/grain whisky categories